Estimating the maximal margin hyperplane corresponds to maximising (believe me, or work your way through Sidharth’s page)
\[\sum_i{\alpha_i} - \frac12 \sum_i\sum_k\alpha_i\alpha_ky_iy_k\mathbf{x}_i^T\mathbf{x}_k\] where \(\mathbf{x}_i, \mathbf{x}_k\) are two \(p\)-dimensional data vectors, and the coefficients of the separating hyperplane
\[\beta_0 + \beta_1 x_1 + \dots + \beta_p x_p = 0\]
are computed from the support vectors and weights \(\alpha\)’s from the optimisation as follows:
\[\mathbf{\beta}_j=\sum_{i=1}^s (\alpha_iy_i)\mathbf{x}_{ij}\]
Now, \(i, k = 1, ..., n\) but because only some observations are used to compute \(\beta\) most are 0, and we can sum only over \(1, ..., s\), where \(s\) is the number of support vectors.
With the kernel trick,
\[\sum_i{\alpha_i} - \sum_i\sum_k\alpha_i\alpha_ky_iy_kK(\mathbf{x}_i^T\mathbf{x}_k)\]
Try one kernel function transformation to show that \(K(\mathbf{x}_i^T\mathbf{x}_k) = \psi(\mathbf{x}_i)^T\psi(\mathbf{x}_k\). You can think of \(psi()\) as transformations of the predictors, \(\mathbf{x}\).
Fill in the steps to go from the first line to the last. Note that \(p=2\). (You can find all the steps in the lecture notes.)
\[\begin{align*} \mathcal{K}(\mathbf{x_i}, \mathbf{x_k}) & = (1 + \langle \mathbf{x_i}, \mathbf{x_k}\rangle) ^2 \\ & = \left(1 + \sum_{j = 1}^2 x_{ij}x_{kj} \right) ^2 \\ & = (1 + x_{i1}x_{k1} + x_{i2}x_{k2})^2 \\ & = (1, x_{i1}^2, x_{i2}^2, \sqrt2x_{i1}, \sqrt2x_{i2}, \sqrt2x_{i1}x_{i2})^T(1, x_{k1}^2, x_{k2}^2, \sqrt2x_{k1}, \sqrt2x_{k2}, \sqrt2x_{k1}x_{k2}) \\ & = \langle \psi(\mathbf{x_i}), \psi(\mathbf{x_k}) \rangle \end{align*}\]
Have a chat about why this algebraic “trick” is neat.
Fit the tree to olive oils, using a training split of 2/3, using only regions 2, 3, and the predictors linoleic and arachidic. Report the balanced accuracy of the test set, and make a plot of the boundary.
The balanced accuracy is 0.99.
Fit a random forest to the full data, using only linoleic and arachidic as predictors, report the balanced accuracy for the test set, and make a plot of the boundary.
The balanced accuracy is 1.
Explain the difference between the single tree and random forest boundaries.
The forest model is ever so slightly curved around the two clusters, whereas the tree is a single split. Note though, that with a different seed, the boundary is sometime boxy at the higher values of arachidic. Using a larger number of trees in the forest might stabilise the result.
Fit the random forest again to the full set of variables, and compute the variable importance. Describe the order of importance of variables.
# 2 3 MeanDecreaseAccuracy MeanDecreaseGini
# linoleic 0.53 0.333 0.408 58
# arachidic 0.10 0.044 0.066 21
The most important variables by far are linoleic and oleic, with arachidic much less important.
Create a new variable called linoarch that is \(0.377 \times linoleic + 0.926\times arachidic\). Make a plot of this variable against arachidic. Fit the tree model to the same training data using this variable in addition to linoleic and arachidic. Why doesn’t the tree use this new variable? It has a bigger difference between the two groups than linoleic? Change the order of the variables, so that linoarch is before linoleic and re-fit the tree. Does it use this variable now? Why do you think this is?
Yes, it sees the new variable when order is changed. The initial fit doesn’t see the better variable because both variables have a split with the same impurity value, and thus the first variable entered into the model is the one that is selected. We can see that the new variable is “better” than the first because there is a bigger gap between the two groups, but this isn’t a factor considered by a tree model.
Fit the random forest again to the full set of variables, including linoarch and compute the variable importance. Describe the order of importance of variables. Does the forest see the new variable?
# 2 3 MeanDecreaseAccuracy MeanDecreaseGini
# linoleic 0.286 0.183 0.22 35
# arachidic 0.053 0.015 0.03 12
# linoarch 0.285 0.176 0.22 32
Yes, the forest sees the new variable. However, it considers it to be equally as important as linoleic. The forest also doesn’t recognise that the new variable is better because it also is not considering the magnitude of the gap between groups.
Fit the linear SVM to olive oils, using a training split of 2/3, using only regions 2, 3, and the predictors linoleic and arachidic. It can be helpful to standardise the variables before fitting svm, and then set scaled = FALSE as the argument to the fitting function.
Report the balanced accuracy of the test set, list the support vectors, the coefficients for the support vectors and the equation for the separating hyperplane, and \[???\times\text{linoleic}+???\times\text{arachidic}+??? > 0\] and make a plot of the boundary, overlaid by the data with support vectors marked.
# Setting default kernel parameters
The \(\alpha\)’s, indexes of support vectors and \(\beta_0\), and the observations that are the support vectors are:
# [[1]]
# [1] -5.2 2.7 2.4
# [1] 5 110 139
# [1] -1.2
# # A tibble: 3 x 2
# linoleic arachidic
# <dbl> <dbl>
# 1 0.614 0.351
# 2 -0.348 1.63
# 3 0.410 -1.40
# [1] 2 3 3
# Levels: 2 3
You need to use the formula
\[\mathbf{\beta}_j=\sum_{i=1}^s (\alpha_iy_i)\mathbf{x}_{ij}\]
to compute the remaining coefficients.
\(\beta_1 = -5.2*0.614+2.7*(-0.348)+2.4*0.410\)
\(\beta_2 = -5.2*0.351+2.7*1.63+2.4*(-1.40)\)
which would give the equation of the separating hyperplane to be
\[1.2 + -3.1484\mbox{linoleic} -0.7842\mbox{arachidic} = 0\]
and rearranging for arachidic as the focus gives
\[\mbox{arachidic} = 1.2/0.7842 + -3.1484/0.7842\mbox{x linoleic}\]
slope of the line is -4.014792, and intercept is 1.530222, gives the line drawn on the plot above.
Fit a radial kernel SVM, with a variety of cost values, to examine the effect on the boundary.
The boundary is always a small circle wrapping the smaller group. Very low values of cost break the fit, though, and give poor prediction of the smaller group.